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Abstract 

Many problems in machine learning and statistics can be formulated as (general- 
ized) eigenproblems. In terms of the associated optimization problem, comput- 
ing linear eigenvectors amounts to finding critical points of a quadratic function 
subject to quadratic constraints. In this paper we show that a certain class of con- 
strained optimization problems with nonquadratic objective and constraints can be 
understood as nonlinear eigenproblems. We derive a generalization of the inverse 
power method which is guaranteed to converge to a nonlinear eigenvector. We 
apply the inverse power method to 1 -spectral clustering and sparse PCA which 
can naturally be formulated as nonlinear eigenproblems. In both applications we 
achieve state-of-the-art results in terms of solution quality and runtime. Moving 
beyond the standard eigenproblem should be useful also in many other applica- 
tions and our inverse power method can be easily adapted to new problems. 



1 Introduction 

Eigenvalue problems associated to a symmetric and positive semi-definite matrix are quite abundant 
in machine learning and statistics. However, considering the eigenproblem from a variational point 
of view using Courant-Fischer-theory, the objective is a ratio of quadratic functions, which is quite 
restrictive from a modeling perspective. We show in this paper that using a ratio of p-homogeneous 
functions leads quite naturally to a nonlinear eigenvalue problem, associated to a certain nonlin- 
ear operator. Clearly, such a generalization is only interesting if certain properties of the standard 
problem are preserved and efficient algorithms for the computation of nonlinear eigenvectors are 
available. In this paper we present an efficient generalization of the inverse power method (IPM) 
to nonlinear eigenvalue problems and study the relation to the standard problem. While our IPM 
is a general purpose method, we show for two unsupervised learning problems that it can be easily 
adapted to a particular application. 

The first application is spectral clustering ll22ll . In prior work J6) we proposed p-spectral clustering 
based on the graph p-Laplacian, a nonlinear operator on graphs which reduces to the standard graph 
Laplacian for p = 2. For p close to one, we obtained much better cuts than standard spectral clus- 
tering, at the cost of higher runtime. Using the new IPM, we efficiently compute eigenvectors of the 
1-Laplacian for 1-spectral clustering. Similar to the recent work of 1211 . we improve considerably 
compared to |6| both in terms of runtime and the achieved Cheeger cuts. However, opposed to the 
suggested method in [21 1 our IPM is guaranteed to converge to an eigenvector of the 1-Laplacian. 

The second application is sparse Principal Component Analysis (PCA). The motivation for sparse 
PCA is that the largest PCA component is difficult to interpret as usually all components are nonzero. 
In order to allow a direct interpretation it is therefore desirable to have only a few features with 
nonzero components but which still explain most of the variance. This kind of trade-off has been 
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widely studied in recent years, see lfT6ll and references therein. We show that also sparse PCA has a 
natural formulation as a nonlinear eigenvalue problem and can be efficiently solved with the IPM. 



2 Nonlinear Eigenproblems 

The standard eigenproblem for a symmetric matric A £ M. nxn is of the form 

Af - Xf = 0, (1) 

where / £ E™ and A £ R. It is a well-known result from linear algebra that for symmetric matrices 
A, the eigenvectors of A can be characterized as critical points of the functional 

-Fstandard(J) = ~YJ7^~ ■ @) 
II /II2 

The eigenvectors of A can be computed using the Courant-Fischer Min-Max principle. While the 
ratio of quadratic functions is useful in several applications, it is a severe modeling restriction. This 
restriction however can be overcome using nonlinear eigenproblems. In this paper we consider 
functionals F of the form 

nn = f$. o) 

where with E+ = {x £ E | x > 0} we assume R : E" -> E+, 5" : E" -> R+ to be convex, 
Lipschitz continuous, even and positively p-homogeneous 1 with p > 1. Moreover, we assume that 
S(f) = if and only if / = 0. The condition that R and S are p-homogeneous and even will imply 
for any eigenvector v that also av for a £ E is an eigenvector. It is easy to see that the functional of 
the standard eigenvalue problem in Equation |2| is a special case of the general functional in (|3). 

To gain some intuition, let us first consider the case where R and S are differentiable. Then it holds 
for every critical point /* of F, 

vf(d=o vR(n - 511 • V5 (-H = ■ 

S(f*) 

Let r, s : W l -> R n be the operators defined as r(f) = VR(f), s(f) = VS(f) and A* = SO, 
we see that every critical point /* of F satisfies the nonlinear eigenproblem 

r(/* ) - A* 8(D = 0, (4) 

which is in general a system of nonlinear equations, as r and s are nonlinear operators. If R and S 
are both quadratic, r and s are linear operators and one gets back the standard eigenproblem ([TJ. 

Before we proceed to the general nondifferentiable case, we have to introduce some important con- 
cepts from nonsmooth analysis. Note that F is in general nonconvex and nondifferentiable. In the 
following we denote by dF(f) the generalized gradient of F at / according to Clarke 1 10], 

9F(f) = {£ G E™ | F°(f, v) > (£, v) , for all v e E n }, 

where F°(f, v) = lim g _j./. t ^o sup F ^ 9+tv ^~ F ^ . In the case where F is convex, dF is the subdif- 
ferential of F and F°(f, v) the directional derivative for each v £ E". A characterization of critical 
points of nonsmooth functionals is as follows. 

Definition 2.1 (|8|) A point f e R n is a critical point of F, ifO € dF. 

This generalizes the well-known fact that the gradient of a differentiable function vanishes at a 
critical point. We now show that the nonlinear eigenproblem Q is a necessary condition for a 
critical point and in some cases even sufficient. A useful tool is the generalized Euler identity. 

Theorem 2.1 (|23|) Let R : W l — > E be a positively p-homogeneous and convex continuous func- 
tion. Then, for each x £ M. n and r* £ dR(x) it holds that (x, r*) = p R(x). 



'A function G : R n — > R is positively homogeneous of degree p if G(jx) = j p G(x) for all 7 > 0. 
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The next theorem characterizes the relation between nonlinear eigenvectors and critical points of F. 



Theorem 2.2 Suppose that R, S fulfill the stated conditions. Then a necessary condition for f* 
being a critical point of F is 

€ 8R(f*) - X* dS(f), where X* = (5) 
If S is continuously differentiable at /*, then this is also sufficient. 

Proof: Let /* fulfill the general nonlinear e igen problem in ([5), where r* G 8R(f*),s* G 8S(f*), 
such that r* - X* s* = 0. Then by Theorem[fl] 

= (ao -a* (r,s*)= P R(n- P x*s(n, 

and thus A* = R(f*)/S(f*). As R, S are Lipschitz continuous, we have, see Prop. 2.3.14 in jlOj, 

8 (R\ m c S(f)dR(f)-R(f)dS(f) 

\sJ [J) - S(f) 2 ' w 

Thus if /* is a critical point, that is G dF(f*), then G dR(f*) - ^QdS(f*) given 
that /* 0. Moreover, by Prop. 2.3.14 in iflOll we have equality in (|6j, if S is continuously 
differentiable at /* and thus |5]) implies that /* is a critical point of F. □ 



Finally, the definition of the associated nonlinear operators in the nonsmooth case is a bit tricky as 
r and s can be set-valued. However, as we assume R and S to be Lipschitz, the set where R and S 
are nondifferentiable has measure zero and thus r and s are single-valued almost everywhere. 

3 The inverse power method for nonlinear Eigenproblems 

A standard technique to obtain the smallest eigenvalue of a positive semi-definite symmetric matrix 
A is the inverse power method fT3l . Its main building block is the fact that the iterative scheme 

Af k+i = f k (7) 
converges to the smallest eigenvector of A. Transforming |7]i into the optimization problem 

f k+1 = a vgmml(u,Au)-( U ,f k ) (8) 

is the motivation for the general IPM. The direct generalization tries to solve 

G r(f k+1 ) — s(f k ) or equivalent^ f k+1 = axgmini?(u) - (u, s(f k )) , (9) 

u 

where r(/) G dR(f) and s(f) G 8S(f). For p > 1 this leads directly to Algorithm|2] however 
for p = 1 the direct generalization fails. In particular, the ball constraint has to be introduced in 
Algorithm [T] as the objective in the optimization problem |9]) is otherwise unbounded from below. 
(Note that the 2-norm is only chosen for algorithmic convenience). Moreover, the introduction of 
Aft in Algorithm [T] is necessary to guarantee descent whereas in Algorithm [2] it would just yield a 
rescaled solution of the problem in the inner loop (called inner problem in the following). 

For both methods we show convergence to a solution of Q, which by Theorem |2.2| is a neces- 
sary condition for a critical point of F and often also sufficient. Interestingly, both applications are 
naturally formulated as 1 -homogeneous problems so that we use in both cases Algorithm [T] Never- 
theless, we state the second algorithm for completeness. Note that we cannot guarantee convergence 
to the smallest eigenvector even though our experiments suggest that we often do so. However, as 
the method is fast one can afford to run it multiple times with different initializations and use the 
eigenvector with smallest eigenvalue. 

The inner optimization problem is convex for both algorithms. In turns out that both for 1 -spectral 
clustering and sparse PCA the inner problem can be solved very efficiently, for sparse PCA it has 
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Algorithm 1 Computing a nonlinear eigenvector for convex positively p-homogeneous functions R 
and S with p = 1 

1: Initialization: /° = random with ||/°|| = 1, A = F(f°) 

2: repeat 

3: f k+1 = argmin{i?(w) - X k (u, s{f k ))} where s(f k ) € dS(f k ) 

||u|| 2 <l 

4: A fe+1 =R(f k+1 )/S{f k+1 ) 

5: until J p 1 < e 

6: Output: eigenvalue A +1 and eigenvector / +1 . 



Algorithm 2 Computing a nonlinear eigenvector for convex positively p-homogeneous functions R 
and S with p > 1 

1: Initialization: f° = random, A = F(/°) 
2: repeat 

3: g k+1 = argmin{i?(u) - (w,s(/ fe ))} where s(/ fc ) £ dS^) 

4: J fe+1 = 5 fc + 1 /5( ff fc+1 ) 1 /P 
5: A fc+1 =R(f k+1 )/S(f k+1 ) 
., |A fc + 1 -A fc | 

6: until J p L < e 

7: Output: eigenvalue X k+1 and eigenvector 



even a closed form solution. While we do not yet have results about convergence speed, empirical 
observation shows that one usually converges quite quickly to an eigenvector. 

To our best knowledge both suggested methods have not been considered before. In [5 ] they propose 
an inverse power method specially tailored towards the continuous p-Laplacian for p > 1, which 
can be seen as a special case of Algorithm[2] In |[T6l a generalized power method has been proposed 
which will be discussed in Section [5] Finally, both methods can be easily adapted to compute the 
largest nonlinear eigenvalue, which however we have to omit due to space constraints. 

Lemma 3.1 The sequences f k produced by Alg.\l\and^\satisfy F(f k ) > F(f k+1 )for all k > or 
the sequences terminate. 

Theorem 3.1 The sequences f k produced by Algorithms^and^converge to an eigenvector f* 
with eigenvalue A* € [0, i< 1 (/ )] in the sense that it solves the nonlinear eigenproblem Q. If S is 
continuously differentiable at /*, then F has a critical point at f*. 

Throughout the proofs, we use the notation $ (u) = R(u) — \ k (u, s(/ fe )) and ^ p (u) = R(u) — 
(u, s(f k )) for the objectives of the inner problems in Algorithms |T|&[2j respectively. 

Proof of Lemma |3.1| for Algorithm [TJ First note that the optimal value of the inner problem is 
non-positive as $JT(0) = 0. Moreover, as is 1-homogeneous, the minimum of $ fk is always 

attained at the boundary of the constraint set. Thus any f k fulfills ||/ fc ||2 — 1 an d mus is feasible, 
and 

min^n^ < $ /fc (/ fe ) - R(f k ) - X k {f k , s(f k )) = R(f k ) - F(f k ) ■ S(f k ) = , 



where we used (/ fc , s(/ fe )) = S(f k ) from Theorem 2.1 If the optimal value is zero, then f k is a 
possible minimizer and the sequence terminates and f k is an eigenvector see proof of Theorem 3.1 



for Algorithm uj Otherwise the optimal value is negative and at the optimal point f k+1 we get 
R(f k+1 ) < xF(f h+1 ,s(f k )). The definition of the subdifferential s(f k ) together with the 1- 
homogeneity of S yields 

S(/ fe+1 ) > S(f k ) + (f k+1 - f k ,s(f k )) = (f k+ \s(f k )) , 
and finally Ftf^ 1 ) = fg^j < \ k = F(f k ). □ 
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Proof of Theorem 3.1 for Algorithm [lj By Lemma pTT| the sequence F(f k ) is monotonically 
decreasing. By assumption S and R are nonnegative and hence F is bounded below by zero. Thus 
we have convergence towards a limit 

A* = lim F(f k ) . 

k— > oo 

Note that ||/ fc || < 1 for every k, thus the sequence f k is contained in a compact set, which implies 
that there exists a subsequence f kj converging to some element /*. As the sequence F(f kj ) is a 
subsequence of a convergent sequence, it has to converge towards the same limit, hence also 

lim F{f k i) = X* . 

J-N50 

As shown before, the objective of the inner optimization problem is nonpositive at the optimal point. 
Assume now that miriM f m2 <1 $ /» (/) < 0. Then the vector /** = arg min <f> /» (/) satisfies 

ll/ll|<l 

R(D < A* (/**, «(/*)) = X* (S(f') + (/** - r > s(f* )» < A*5(/**) , 
where we used the definition of the subdifferential and the 1-homogeneity of S. Hence 

F(f') < X* = F(f') , 

which is a contradiction to the fact that the sequence F(f k ) has converged to A*. Thus we must 
have miri||y||2 <1 <E> f, (/) = 0, i.e. the function <£> f* (/) is nonnegative in the unit ball. Using the fact 
that for any a > 0, 

3>f*(af) = a$/» , 

we can even conclude that the function $/*(/) is nonnegative everywhere, and thus min /$/*(/) = 
0. Note that $ f * (/*) = 0, which implies that /* is a global minimizer of $ and hence 

G = dR(f*) - X'dS(f') , 

which implies that /* is an eigenvector with eigenvalue A* . Note that this argument was independent 
of the choice of the subsequence, thus every convergent subsequence converges to an eigenvector 
with the same eigenvalue A*. Clearly we have A* < F(f°). □ 

The following lemma is useful in the convergence proof of Algorithm[2] 

Lemma 3.2 Let R be a convex, positively p-homogeneous function with p > 1. Then for any 
x € R n , t > and any r* £ dR(x) we have t^ x r* g dR(tx). 

Proof: Using the definition of the subgradient, we have for any y £ M. n and any t > 0, 

t p R(y) >t p R(x)+t p (r*,y-x) . 
Using the p-homogeneity of R, we can rewrite this as 

R(tv) > R(tx) + (^r* ,ty - tx) , 
which implies t p ~ x r* € dR(tx). □ 

The following Proposition generalizes a result by Zarantonello [24|. 

Proposition 3.1 Let R : R" ->Kfea convex, continuous and positively p-homogeneous and even 
functional and dR(f) its subdifferential at f. Then it holds for any f,g£ R n and r(/) G dR(f), 

\{r(f),g)\ < (r(f)J) 1 -* (r(g),g)» = p ■ R(f) 1 - 5 R(g)$ . 

Proof: First observe that for any k points x , . . . x^-i G M. n , the subdifferential inequality yields 

R(x t ) > R(xt-i) + (r(xi-i),xi - xi-!) , VI < I < k - 1 
R(xq) > R(xk-i) + (r(xk-i),x - Xk-i) , 
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and hence, by summing up, 

(r(xo),xi — xq) h h (r(x k -2), %k-i - Xk-2) + (r(x k -i),x - x k -x) < . (10) 

Let now /, g £ W 1 , and r(f) £ dR(f),r(g) £ dR(g). We construct a set of 2m points 
so, . . . X2m-i m K n > where m £ N, as follows: 

_ ; ^/ , < i < m - 1 

2m-l-t ;TO <i<2m-l ' 



By Lemma 3.2 for all z G {0, . . . 2m — 1} there exists an r*(x,*) G dR(xi) s.t. 



r*fz)-/ (^)^ lr ^ ,0<*<m-l 
I _1 r(ff) ,m<i<2m-l 



Eq. ( |T0] > now yields 



' — ' \ v m I m \ m 



p-l 



^— ' \ V m J m \ m 



which simplifies to 



1 E f-) P " - 1 ) <K/), f) ( 1 E (£P ) <r(s)>9) + — M/),.9> < o . 

(11) 




777 * — ' \ 777 / ; I 777 ' — ' \ 1TL / 171 



By letting 777 — > 00 we obtain for the two sums 

!»» ^Ef-V" 1 ) - lim f 1 E ^1 = f 1 f- 1 dj = 1 -- 

m-yoo \ 777 ^— ^ \ m / y m->oo ^777 * — J J JO P 

Hence in total in the limit 777 — > 00 Eq. (jTTJ becomes 

(K/),s)-(i-^) M/),/)--(r(. 9 ),. 9 )<o. 

\ PJ P 

As the above inequality holds for all /, g £ W l , clearly we can now perform the substitution / 

i" 1 /, g -> r(/) -> ^ (p_1) r(/), r(#) -4- t^'^rig), where i € R+, which gives 



(r(/), 5) - ( 1 - 1 ) (r(f)J) it^" 1 ) (r( 5 ), 5 ) < . 
\ PJ P 

A local optimum with respect to t of the left side satisfies the necessary condition 

= (p - 1) r^ 1 (r(/), /) - (p - l)^- 1 (rfo), 3) 
= ^- 1 (p-l)((r(/),/>-^(r( 5 ) )5 >) , 

which implies that 

>(/),/> V 



(12) 



Plugging this into ( p~2] > yields 



> (r(/), fl > -(l--) (r(g),9)* (r(f)J) 1 -" - H/),/) 1 ^ (r(g),g)* 
VP/ P 

= {r(f),g)-{r(f)J) 1 -»{r(g),g)* . 



6 



By the homogeneity of R we then have 

(r(f),g) < (r(f)J) 1 -* (r(9),9)» = P ■ R(f) 1 '* R{g)» ■ 
Finally, note that we can replace the left side by its absolute value since replacing g with — g yields 

(r(f),-g) <P-i2(/) 1_ *i2(-fl)' =P-R(j) 1 --R{g)- , 
where we used the fact that R is even. □ 



Proof of Lemma 3.1 for Algorithm |2j Note that as R(u) > 0, the minimum of the objective of 
the inner problem is attained for some u with (u, s(f k )) > 0. Choose u such that (u, s(f k )) > 0. 
Then we minimize \& * * on the ray tu, t > 0. We have 

$7* (tu) = R{tu) - (t u, s(f k )) = t p R{u) - t (u, s(f k )) 

and hence 

and thus the minimum is attained at t*(u) = (^^j^) ) P 1 > and 

* f , (t* (u)u) = f (ufRiu) - t* (u) (u, s(f k )) = (1 - p) (^0^) ^ ■ 

Assume there exists u that satisfies \& tk(u) < ^fk(f) where / = F(f k ) T =p f k . Hence, also 
* fk(t*(u)u) < &fk(f), which implies 

{i ~ p i {u P pR(I) P ) ^ < p v k ^ R ^ - </*. 

= F(f k )^(l-p), 

where we used the fact that (f k , s{f k )) = pS(f k ) and S(f k ) = 1. Rearran ging, we obtain 

k pQR{u) 



Using the Holder-type inequality of Proposition 3.1 and S(f k ) — 1, we obtain 

(u,s(f k )) < P S(f k ) l -pS(u)v =p5(«)5, 

which gives F(f k ) > F(u). Let now /* be the minimizer of \& fk. Then /* satisfies \P < 
^fk(f). If equality holds then / = F(f k ) T ^ r p f k is a minimizer of the inner problem and the 
sequence terminates. In this case f k is an eigenvector, see proof of Theorem 3.1 for Algorithm]^ 
Otherwise ^> f k (/*) < ^ jk (/) and thus u = f* fulfills the above assumption and we get F(f k ) > 
F(f*), as claimed. □ 

Proof of Theorem 3.1 for Algorithm|2j Note that as F(f) > 0, the sequence F(f k ) is bounded 
from below, and by Lemma [3~T| it is monotonically decreasing and thus converges to some A* E 
[0, F(f )}. Moreover, S(f k ) = 1 for all k. As S is continuous it attains its minimum m on the unit 
sphere in R n . By assumption m > 0. We obtain 

i-W>-*( 1 ^ll/1,)*H«. -* WfX^df- 

Thus the sequence f k is bounded and there exists a convergent subsequence f kj . Clearly, 
lim ;7 -_j. 00 F(f i ) = lim,fe_).oo F(f k ) = A*. Let now /* = lim^oo / * , and suppose that there 

exists a£l" with \P f* (u) < (/) where f = F(f*) T ^ f* . Then, analogously to the proof of 
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Lemma 3.1 one can conclude that F(u) < F(f* 
as its limit A* . Thus / is a minimizer of ** , which implies 



Oe 



A* which contradicts the fact that F(f kj ) has 
p-i 



d*f*(F(f*)^f*) = dR(F(r)^r)-s(r) = (F(r)Thy sij(r) -«(/*) 

= ^)H f) " f(f)s(f) )' 

so that /* is an eigenvector with eigenvalue A* . As this argument was independent of the subse- 
quence, any convergent subsequence of f k converges towards an eigenvector with eigenvalue A*. 
□ 



Practical implementation: By the proof of Lemma 3.1 descent in F is not only guaranteed for 
the optimal solution of the inner problem, but for any vector u which has inner objective value 

$ f h(u) < = <f> f k(f k ) for Alg.[ljand typ(u) < ^ fk (F(f k )^ f k ) in the case of Alg.^ This 
has two important practical implications. First, for the convergence of the IPM, it is sufficient to use 
a vector u satisfying the above conditions instead of the optimal solution of the inner problem. In 
particular, in an early stage where one is far away from the limit, it makes no sense to invest much 
effort to solve the inner problem accurately. Second, if the inner problem is solved by a descent 
method, a good initialization for the inner problem at step k + 1 is given by f k in the case of Alg.[T] 
and F(f k ) T ^ r p f k in the case of Alg.^as descent in F is guaranteed after one step. 

4 Application 1: 1 -spectral clustering and Cheeger cuts 

Spectral clustering is a graph-based clustering method (see ll22ll for an overview) based on a re- 
laxation of the NP-hard problem of finding the optimal balanced cut of an undirected graph. The 
spectral relaxation has as its solution the second eigenvector of the graph Laplacian and the final par- 
tition is found by optimal thresholding. While usually spectral clustering is understood as relaxation 
of the so called ratio/normalized cut, it can be equally seen as relaxation of the ratio/normalized 
Cheeger cut, see [6|. Given a weighted undirected graph with vertex set V and weight matrix W, 
the ratio Cheeger cut (RCC) of a partition (C, C), where C C V and C = V\C, is defined as 

RCC(C,C):= Cn ^ % where cut(A,B) = ]T Wij , 

where we assume in the following that the graph is connected. Due to limited space the normalized 
version is omitted, but the proposed IPM can be adapted to this case. In [6 1 we proposed p-spectral 
clustering, a generalization of spectral clustering based on the second eigenvector of the nonlinear 
graph p-Laplacian (see 0; the graph Laplacian is recovered forp = 2). The main motivation was 
the relation between the optimal Cheeger cut /ircc = niinccv RCC(C, C) and the Cheeger cut 
^rcc obtained by optimal thresholding the second eigenvector of the p-Laplacian, see SID, 

y p > i f Hrcc < ^RCC < „ ( h RCC 



maxigydi maxigydi \maxi e y< 

where di = J^iev w ij denotes the degree of vertex i. While the inequality is quite loose for spectral 
clustering (p = 2), it becomes tight for p — >• 1 . Indeed in 1 6 1 much better cuts than standard spectral 
clustering were obtained, at the expense of higher runtime. In |F2~T1 the idea was taken up and they 
considered directly the variational characterization of the ratio Cheeger cut, see also EH], 

, . 2 Si.j = l w ij\fi — fj\ . 2 SiJ=l w ij\fi ~ fj\ 

ft RCC — mm/ nonconstant ITT T / r\ -, i — ~ mm / nonconatant r—rr, . (13) 

11/ -mediamjjlll! median(f)=0 \\fWx 

In [21 1 they proposed a minimization scheme based on the Split Bregman method [12]. Their method 
produces comparable cuts to the ones in (6), while being computationally much more efficient. 
However, they could not provide any convergence guarantee about their method. 

In this paper we consider the functional associated to the 1-Laplacian A x , 

F (f\_ lT,lj = l w ij\fi- fj\ _ (/,Ai/) 
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where 



(Ai/)j = { X! w v v "ij I u v = u « e si S n (/i ~ /i) } and sign(a;) = < [-1, 1], x 
j=i { l, a; 

and study its associated nonlinear eigenproblem G Ai / — A sign(/). 



-1, x < 0, 
= 0, 
> 0. 



Proposition 4.1 Any non-constant eigenvector f* of the 1-Laplacian has median zero. Moreover, 
let X 2 be the second eigenvalue of the 1-Laplacian, then if G is connected it holds \ 2 = /ircc- 

Proof: The subdifferential of the enumerator of F\ can be computed as 
where we use the set-valued mapping 

f -1, x < 0, 
sign( a; ) = ^ [-1,1], 1 = 0, 
{ 1, x>0. 

Moreover, the subdifferential of the denominator of F\ is 

9II/H! =sign(/). 

Note that, assuming that the graph is connected, any non-constant eigenvector /* must have A* > 0. 
Thus if /* is an eigenvector of the 1-Laplacian, there must exist Uij with = — Uji and g 
sign(/* — f*) and on with a ; e sign(/*) such that 

n 

= WijUij - X*a % . 

3 = 1 

Summing over i yields due to the anti-symmetry of Uij, J2i a i = I/+I — + S/*=o ai = ^' 
where |/* | are the cardinalities of the positive and negative part of /* and |/g | is the number 
of components with value zero. Thus we get 

|i/;m/-H <i/oi, 

which implies with | /* | + | /* | + | /* | = | V \ that | /* | < ^ and | /* | < M . Thus the median of /* 
is zero if \V\ is odd. If | V\ is even, the median is non-unique and is contained in [max ft , min fl ) 
which contains zero. 

If the graph is connected, the only eigenvector corresponding to the first eigenvalue Ai = 
of the 1-Laplacian is the constant one. As all non-constant eigenvectors have median zero, it 
follows with Equation [13] that X 2 > /irco F° r the other direction, we have to use the algorithm 
we present in the following and some subsequent results. By Lemma |4.2| there exists a vector 
/* = lc with \C\ < \C\ such that Fi(/*) = /ircc- Obviously, /* is non-constant and has 
median zero and thus can be used as initial point f° for Algorithm^ By Lemma 4.1 starting with 



f° = f* the sequence either terminates and the current iterate f° is an eigenvector or one finds a 
f 1 with Fi(f 1 ) < Fi(f°), where f 1 has median zero. Suppose that there exists such a / , then 
Fi(f 1 ) < = min /„„„,„,„, Fi{f) which is a contradiction. Therefore the sequence has 

mcdian(f )— 



to terminate and thus by the argument in the proof of Theorem 4. 1 the corresponding iterate is an 



eigenvector. Thus we get /jrcc > ^2 and thus with A2 > /irgc we arrive at the desired result. □ 



For the computation of the second eigenvector we have to modify the IPM which is discussed in the 
next section. 
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4.1 Modification of the IPM for computing the second eigenvector of the 1-Laplacian 

The direct minimization of ( fl4] > would be compatible with the IPM, but the global minimizer is the 
first eigenvector which is constant. For computing the second eigenvector note that, unlike in the 
case p — 2, we cannot simply project on the space orthogonal to the constant eigenvector, since 
mutual orthogonality of the eigenvectors does not hold in the nonlinear case. 

Algorithm [3] is a modification of Algorithm [T] which computes a nonconstant eigenvector of the 1- 
Laplacian. The notation 1, \f k+1 | and \f^ +1 1 refers to the cardinality of positive, negative and 
zero elements, respectively. Note that Algorithm [TIrequires in each step the computation of some 
subgradient s(f k ) € dS(f k ), whereas in Algorithmplthe subgradient v k has to satisfy l) = 0. 
This condition ensures that the inner objective is invariant under addition of a constant and thus 
not affected by the subtraction of the median. Opposite to ETI we can prove convergence to a 
nonconstant eigenvector of the 1-Laplacian. However, we cannot guarantee convergence to the 
second eigenvector. Thus we recommend to use multiple random initializations and use the result 
which achieves the best ratio Cheeger cut. 



Algorithm 3 Computing a nonconstant 1-eigenvector of the graph 1-Laplacian 



Input: weight matrix W 

Initialization: nonconstant /° with median(/°) = and |[/° j = 1, accuracy e 
repeat 

/+ 1 = arg min { \ Wij \h - fj \ - X k (/, v k ) } 

\\f\\l<i 1 J 
5: f k+1 = g k+1 - median (g k+1 ) 

f sign(/,f +1 ), if/^ + V0, 

I l/o* +1 | ' 11 h ~ U ' 

7: A fe+1 = F 1 (/ fe+1 ) 

., |A* +1 -A*i 
8: until J Yk L < e 



Lemma 4.1 The sequence f k produced by Algorithm^satisfies Fi(f k ) > Fi(f k+1 ) for all k > 
or the sequence terminates. 



Proof: Note that, analogously to the proof of Lemma 3.1 we can conclude that the inner objective 
is nonpositive at the optimum, where the sequence terminates if the optimal value is zero as the 
previous f k is among the minimizers of the inner problem. Now observe that the objective of the 
inner optimization problem is invariant under addition of a constant. This follows from the fact that 
we always have l) = 0, which can be easily verified. Hence, with R(f) — | w ij\fi ~~ 

fj\, we get 

R(f k + 1 )- \ k (f k +\v k ) =R(g k+1 )-\ k (g k+ \v k ) <0. 
Dividing both sides by || f k+1 1| 1 yields 

R(f k+1 ) yk (f k+1 ^ k ) rn 
ll/ fc+1 lli \\f k+ % 

and with (f k+1 ,v k ) < \\f k+1 \\ 1 \\v k \\ 00 = ||/' £+1 || 1 , the result follows. □ 



Theorem 4.1 The sequence f k produced by Algorithm^jconverges to an eigenvector f* of the 1- 
Laplacian with eigenvalue X* £ [/ircCj Fi(f°)]- Furthermore, F\(f k ) > F\(f k+1 ) for all k > 
or the sequence terminates. 

Proof: Note that every constant vector uo satisfies $jfc(uo) = as (v k , l) = 0. The minimizer 
of $ fk is either negative or the sequence terminates in which case the previous non-constant g k is 
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a minimizer. In any case g k+1 cannot be constant and in turn f k+1 is nonconstant and has median 
zero. Thus for all k, 

where we use that the median of f k is zero. Thus F \ ( f k ) is l ower-bounded by /ircc- Note that 



hR.cc < -^2- We can conclude now analogously to Theorem 3.1 that the sequence ) converges 

to some limit 



A*= lim F 1 {f)> h RCC . 

k— too 



As in Theorem 



3 . 1 1 the compactness of the set containing the sequence g k implies the existence of a 
convergent subsequence g kj , and using the fact that subtracting the median is continuous we ha ve 



linij.j.oo f = g* — mcdian(g*)l =: /*. The proof now proceeds analogously to Theorem 



3.1 



□ 



4.2 Quality guarantee for 1 -spectral clustering 

Even though we cannot guarantee that we obtain the optimal ratio Cheeger cut, we can guarantee 
that 1-spectral clustering always leads to a ratio Cheeger cut at least as good as the one found by 
standard spectral clustering. Let (Cj, Cj) be the partition of V obtained by optimal thresholding of 

/, where Cj = argmin t RCC(C), Cj), and for t G R, Cj = {i G V \ > t}. Furthermore, l c 
denotes the vector which is 1 on C and else. 

Lemma 4.2 Let C,C be a partitioning of the vertex set V, and assume that \C\ < \C\. Then for 
any vector f G R" of the form f = al c , where a € R it holds that F\{f) = RCC(C, C). 

Proof: As F\ is scale invariant, we can without loss of generality assume that a = 1. Then we have 

j-, , ,s 2 Si ,j = l w ij\fi ~ fj\ _ 2 SieCj^C W ij + \ J2is£C,j£C W ij 

Flif) ~ Uk ~_ ^ T^i 

_ cut(C, C) _ cut(C,C _ RCC{CJJ) 



\C\ min{|<7|,|C|} 



□ 



Lemma 4.3 Let f G R™ with median(/) = 0, and C = argmin{|Cy |, \Cj\}. Then the vector 
f* = l c satisfies F 1 (f) > F^f*). 

Proof: Denote by /+ : V — > R the function / ( + := max{0, fi}, and analogously, let / _ := 
max{0, —fi}. Then we have 

Note that we always have | — fj~ — fj~ + fj | = | f^ — ff | + | /r — fj | , which can easily be 
verified by performing a case distinction over the signs of fi and fj . Eq. ( fT5| i can now be written as 

R (f) = lJ2 w a \tf - //I + \jl w a \sr - sj I = R(.n + R(n . 

i,3 id 

Using the fact that || /|| 1 can be decomposed as ||/ + || 1 + || /~ || i, we obtain 

R{f) R(f + ) + R{r) > . Ji?(/+) R(f-)\ .... 

m = \\fn 1 + \\f-\\ 1 - mm {jn'JfTJ- (16) 



ii 



The last inequality follows from the fact that we always have for a, b,c,d> 0, 



a + b . { a b 
> mm 



c + d ( c d 

which can be easily shown by contradiction. Let wlog min {f , 3} = § , and assume that |i| < § ■ 
This implies - > k> which is a contradiction to - < |. Note that median(/) = 0, hence we have 

E argmin^ |/i - c| , 
which implies that G dj^iev anc ^ nence there exist coefficients |o;,| < 1 such that 

= XI si s n (/o + E q " 

n^o /i=o 



which is equivalent to 



> 0}| - \{i,fi < 0}| < \{i,fi = 0}|. This inequality implies that 



ft > 0}\ < ^ and /, < 0}| < ^. We now rewrite as follows: 

1 f°° 

R (f + ) = 2 E ^ - ) = E ^ + ldt= E ^ di • 

-4- . *4- „4- . ,.4- J I a J .4- . . ,.4- 



fT>fJ S] /+>«>// 



Note that for t > 0, 



„ , _ cut(C* C*) „ , 

J2 Wij = cut(C},C}) = r f j^j ■ |C}| > RCC(C},Cp • |C> I 

/+>*>/+ min ( C / ' C f J 



where in the second step we used that 



fit 



< fi > 0}| < Hence we have 

/>00 /"GO 

R(f + )> RCC(C},C*) ■\C}\dt = RCC(C* f ,C*) V ldt 
1/0 ' 1/0 /»>t 

/■/« 

= RCC(C*;,C}) ^ / ldt = RCC(C;,C})||/ + || 1 . 

Hence it holds that F 1 (f+) > RCC(CpCj), and analogously one shows that Fi(/~) > 
RCC(C* / ,0^). Note that RCC(C},C} ) = RCC(C* /5 C^) = F x (/*), by Lemma 
Combining this with Eq. ( fTo*] > yields the result. □ 



Theorem 4.2 Let u denote the second eigenvector of the standard graph Laplacian, and f denote 
the result of Algorithm\3\after initializing with the vector p-lc, where C = argmin{|C*|, |C*|}. 



Then RCC(C*, C* u ) > RCC(C),C* f ). 



Proof: Using Lemma [4~T| and |42] , we have the following chain of inequalities: 

1 



RCC(C*, C*) = Fi 



\C\ 



Fi(lc) > Fx(f). 



With Ci : = arg min{ \Cf | , | C% \ }, we obtain by Lemma 



4.3 



and 



4.2 



f 1 (/)'?f 1 (i Ci ) | Prcc(C},c;) . 



□ 
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4.3 Solution of the inner problem 

The inner problem is convex, thus a solution can be computed by any standard method for solving 
convex nonsmooth programs, e.g. subgradient methods [4 |. However, in this particular case we can 
exploit the structure of the problem and use the equivalent dual formulation of the inner problem. 

Lemma 4.4 Let E C V x V denote the set of edges and A : R E —> R v be defined as (Aa)i = 
Y] j | ^ j)£E w ij a ij- The inner problem is equivalent to 

min {QeRE | hi^i, ai] =~ aji } *(<*) := \\Aa - F(f k )v k \\ 2 . 
The Lipschitz constant of the gradient of^> is upper bounded by 2 max r Y]"-, w^ s . 
Proof: First, we note that 

1 - 1 

2 E w v l Ul ~ u il = max {^e» E I ll/3|U<i} 2 A/ " ■ 
Introducing the new variable onj — — this can be rewritten as 

where we have introduced the notation (Aa). L = J2j \ u j)eE w ij a ij- Both u and a are constrained 
to lie in non-empty compact, convex sets, and thus we can reformulate the inner objective by the 
standard min-max-theorem (see e.g. Corollary 37.3.2. in [18]) as follows: 

min||„|| 2 <i max {ctSR B | || q || oo < 1:Qij . = _ Qj . i} (u,Aa) - F(f k ) (u,v k ) 
= max (aeI E | HcH^i, aij= - aii } miny tt || a <i (u, Aa - F(f k )v k ) 
= rnax {QeRE |,j C( j| oo < lia . 3= _ a3 .. } - \\Aa - F(f k )v k \\ 2 . 

In the last step we have used that the solution of the minimization of the linear function over the 
Euclidean unit ball is given by 

Aa-F(f k )v k 
\\Aa-F(fk)v<*\\ 2 ' 

if || Aa — F(f k )v k \\ 7^ and otherwise u* is an arbitrary element of the Euclidean unit ball. Trans- 
forming the maximization problem into a minimization problem finishes the proof of the first state- 
ment. Regarding the Lipschitz constant, a straightforward computation shows that 

(V*(a)) rs = 2w rs ( E w rja rj - F(f k )v k ^ . 

Thus, 

_ 2 

(r,s)eE 'j\{r.j)eE 

9 X ^ / 

a r 

(r,s)€E j\(r,j)€E i\(r,t)eE 

2 



|W(a)-W(/3)|| 2 = 4 £ W 2 rs ( £ WrMrj-Pn) 

(r,s)€E j | (r.j)eE 

< 4 E «*.( E < E ( 

(t,«)6B j'|(r,j)6S i\{r,i)eE 

= 4 E( E <) E 

r=l s\(r.s)£E i\(r,i)£E 
n 2 

< 4(max r E w rs) E ( 



,2 



s=l (r,i)eE 



□ 
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Algorithm 4 Solution of the dual inner problem with FISTA 
1: Input: Lipschitz-constant L of V\&, 
2: Initialization: t 1 = 1, a 1 e R E , 
3: repeat 

4: 

= a* a - iv*(a 4 ) rs 
2 

3 I (r,j)£B 

5. tfc+1 — 2 ' 

6- a * +1 = + f /3* +1 - 5* 

7: until stop if gap between original and dual problem is smaller than e 



Compared to the primal problem, the objective of the dual problem is smooth. Moreover, it can be 
efficiently solved using FISTA 03]), a two-step subgradient method with guaranteed convergence 
rate O(A-) where k is the number of steps. The only input of FISTA is an upper bound on the Lips- 
chitz constant of the gradient of the objective. FISTA provides a good solution in a few steps which 
guarantees descent in functional ( p"3| ) and thus makes the modified IPM very fast. The resulting 
Algorithm is shown in Alg. [4] 



5 Application 2: Sparse PCA 

Principal Component Analysis (PCA) is a standard technique for dimensionality reduction and data 
analysis 111411 . PCA finds the fc-dimensional subspace of maximal variance in the data. For k = 1, 
given a data matrix X 6 R nxp where each column has mean 0, in PCA one computes 

(f,X T Xf) 

/ =argmax — p , (17) 

/eR p II/II2 

where the maximizer /* is the largest eigenvector of the covariance matrix S = X T X E E pxp . 
The interpretation of the PCA component /* is difficult as usually all components are nonzero. In 
sparse PCA one wants to get a small number of features which still capture most of the variance. 
For instance, in the case of gene expression data one would like the principal components to consist 
only of a few significant genes, making it easy to interpret by a human. Thus one needs to enforce 
sparsity of the PCA component, which yields a trade-off between explained variance and sparsity. 

While standard PCA leads to an eigenproblem, adding a constraint on the cardinality, i.e. the num- 
ber of nonzero coefficients, makes the problem NP-hard. The first approaches performed simple 
thresholding of the principal components which was shown to be misleading Q. Since then several 
methods have been proposed, mainly based on penalizing the L\ norm of the principal components, 
including SCoTLASS [15| and SPCA (25). D'Aspremont et al. [ 1 1 ] focused on the i -constrained 
formulation and proposed a greedy algorithm to compute a full set of good candidate solutions up 
to a specified target sparsity, and derived sufficient conditions for a vector to be globally optimal. 
Moghaddam et al. lTT7l used branch and bound to compute optimal solutions for small problem 
instances. Other approaches include D.C. [20 1 and EM-based methods 1 19 1. Recently, Journee et al. 
|[T6l proposed two single unit (computation of one component only) and two block (simultaneous 
computation of multiple components) methods based on L -P ena li zat i° n an d ^-penalization. 

Problem ( fTTj i is equivalent to 

t, ■ II/II2 • II/II2 

j = arg mm — — ^- = arg mm - — -f— . 
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In order to enforce sparsity we use instead of the L2-norm a convex combination of an L\ norm and 
L2 norm in the enumerator, which yields the functional 

with sparsity controlling parameter a E [0, 1]. Standard PCA is recovered for a = 0, whereas a = 1 
yields the sparsest non-trivial solution: the component with the maximal variance. One easily sees 
that the formulation ( fT8] > fits in our general framework, as both enumerator and denominator are 
1 -homogeneous functions. The inner problem of the IPM becomes 

g k+1 = argmin(l- a) \\f\\ 2 + a \\f\\ 1 -X k </,/') , where // = f/" = . (19) 

ll/ll 2 <i VU ' E / ) 

This problem has a closed form solution. In the following we use the notation x+ = max{0, x}. 
Lemma 5.1 The convex optimization problem \\9) has the analytical solution 

^ k+1 = - sign(^)(A fe - a) + , where s = J^J*W\ «) 2 + • 



9i 

Proof: We note that the objective is positively 1-homogeneous and that the optimum is either zero 
by plugging in the previous iterate or negative in which case the optimum is attained at the boundary. 
Thus wlog we can assume that at the optimum ||/|| 2 = 1. Thus the problem reduces to 

miny/ii^i a\\f\\ 1 -X k (f,iJ l k ). 
First, we derive an equivalent "dual" problem, noting 

all/l^ - \ k (n k J) =max lHoo < 1 (f,av-X k p l k ) . 

Using the fact that the objective is convex in / and concave in v and the feasible set is compact, we 
obtain by the min-max equality: 

min||/|| 2 <i max^n^! (/, av - X k fi k ) = max^n^! min||/|| 2 <i (/, av - X k ^ k ) 

II \ k k 1 1 

= max||„|| <i — av — X [i 2 . 

The objective of the dual problem is separable in v and the constraints of v as well. Thus each 
component can be optimized separately which gives 

Vi = sign(^ fc ) min i 1, — ^ 
L a 

Using that /* = (—av + X k /i k )/ || X k [i k — ov|L, we get the solution 

_ sign(^)(A fc |^ fc |-a)+ 

] 1 — 



Er=i(A fe k fc l-«)^ 

□ 



As s is just a scaling factor, we can omit it and obtain the simple and efficient scheme to compute 
sparse principal components shown in Algorithm [5] While the derivation is quite different from 
|[T6l . the resulting algorithms are very similar. The subtle difference is that in our formulation the 
thresholding parameter of the inner problem depends on the current eigenvalue estimate whereas it 
is fixed in 1 16|. Empirically, this leads to the fact that we need slightly less iterations to converge. 

6 Experiments 

1-Spectral Clustering: We compare our IPM with the total variation (TV) based algorithm by 
1121 1 , p-spectral clustering with p — 1.1 [6| as well as standard spectral clustering with optimal 
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Algorithm 5 Sparse PCA 
1: Input: data matrix X, sparsity controlling parameter a, accuracy e 
2: Initialization: f° = random with S(f k ) = 1, A = F(f k ) 
3: repeat 

4: g k+1 = sign(^)(A fc - a) + , 
5: / 



fc+i 



ll^9 fc + 1 || 2 

6: A fe+1 = (1 - a) ||/ fc+1 || 2 + a||/ fc+1 || 1 

M — ||x/ fc + 1 || 2 
8: until J p L < e 



thresholding the second eigenvector of the graph Laplacian (p — 2). The graph and the two-moons 
dataset is constructed as in [6 1. The following table shows the average ratio Cheeger cut (RCC) and 
error (classification as in [6|) for 100 draws of a two-moons dataset with 2000 points. In the case of 
the IPM, we use the best result of 10 runs with random initializations and one run initialized with 
the second eigenvector of the unnormalized graph Laplacian. For ETTl w e initialize once with the 
second eigenvector of the normalized graph Laplacian as proposed inyT) and 10 times randomly. 
IPM and the TV-based method yield similar results, slightly better than 1.1-spectral and clearly 
outperforming standard spectral clustering. In terms of runtime, IPM and [21 1 are on the same level. 





Inverse Power Method 


Szlam & Bresson 1211 


1.1-spectral |6| 


Standard spectral 


Avg. RCC 
Avg. error 


0.0195 (± 0.0015) 
0.0462 (± 0.0161) 


0.0195 (± 0.0015) 
0.0491 (±0.0181) 


0.0196 (± 0.0016) 
0.0578 (± 0.0285) 


0.0247 (± 0.0016) 
0.1685 (± 0.0200) 




Figure 1: Left and middle: Second eigenvector of the 1 -Laplacian and 2-Laplacian, respectively. 
Right: Relative Variance (relative to maximal possible variance) versus number of non-zero compo- 
nents for the three datasets Lung2, GCM and Prostate 1. 

Next we perform unnormalized 1-spectral clustering on the full USPS and MNIST-datasets (9298 
resp. 70000 points). As clustering criterion we use the multicut version of RCut, given as 

i=i 

We successively subdivide clusters until the desired number of clusters (K — 10) is reached. In each 
substep the eigenvector obtained on the subgraph is thresholded such that the multi-cut criterion is 
minimized. This recursive partitioning scheme is used for all methods. As in the previous experi- 
ment, we perform one run initialized with the thresholded second eigenvector of the unnormalized 
graph Laplacian in the case of the IPM and with the second eigenvector of the normalized graph 
Laplacian in the case of ETI . In both cases we add 100 runs with random initializations. The next 
table shows the obtained RCut and errors. 







Inverse Power Method 


S.&B. [21 1 


1.1-spectral [6| 


Standard spectral 


MNIST 


Rcut 


0.1507 


0.1545 


0.1529 


0.2252 




Error 


0.1244 


0.1318 


0.1293 


0.1883 


USPS 


Rcut 


0.6661 


0.6663 


0.6676 


0.8180 




Error 


0.1349 


0.1309 


0.1308 


0.1686 



Again the three nonlinear eigenvector methods clearly outperform standard spectral clustering. Note 
that our method requires additional effort (100 runs) but we get better results. For both datasets our 
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method achieves the best RCut. However, if one wants to do only a single run, by Theorem 4.2 
for bi-partitions one achieves a cut at least as good as the one of standard spectral clustering if one 
initializes with the thresholded 2nd eigenvector of the 2-Laplacian. 



Sparse PCA: We evaluate our IPM for sparse PCA on gene expression datasets obtained from 
HJ. We compare with two recent algorithms: the L\ based single-unit power algorithm of lfT6ll 
as well as the EM-based algorithm in lfT9l . For all considered datasets, the three methods achieve 
very similar performance in terms of the tradeoff between explained variance and sparsity of the 
solution, see Fig[T] (Right). In fact the results are so similar that for each dataset, the plots of all 
three methods coincide in one line. In [16] it also has been observed that the best state-of-the-art 
algorithms produce the same trade-off curve if one uses the same initialization strategy. 

Acknowledgments: This work has been supported by the Excellence Cluster on Multimodal Com- 
puting and Interaction at Saarland University. 
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